Author: “Purba Roy”
We will need the following R packages.
# Load standard libraries
library(tidyverse)
library(nycflights13)
Here, we will use the data on all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013. We find this data in the nycflights13 R package.
# Load the nycflights13 library which includes data on all
# lights departing NYC
data(flights)
# Note the data itself is called flights, we will make it into a local df
# for readability
flights <- tbl_df(flights)
# Look at the help file for information about the data
?flights
## starting httpd help server ... done
flights
#view(flights)
# summary(flights)
On inspecting the dataset, it shows that the data was collected from RITA, Bureau of transportation statistics, and it gives us the details about all the flights that departed New York, namely the 3 airports- JFK, LGA and EWR in 2013. The dataset consists of 19 variables where year, month and day depict the exact date on which the flights departed by giving us the year, month and exact day of the journey respectively. The dep_time indicates departure time in an Hour-Minute format (HHMM/HMM). The sched_dep_time gives us the scheduled departure time in an hour-minute format. dep_delay represents the difference between the scheduled departure time and actual departure time in minutes. Similarly arr_time, sched_arr_time and arr_delay are the arrival time, scheduled arrival time and the delay between those two, respectively. The carrier variable represents abbreviation of the airline names used for the journey.Flight and tailnum are the flight number, and the flight tail number. Origin and dest stand for the airports used for takeoff and destination respectively. air_time is the total time of the journey in minutes. distance is the total distance between the source and the destination in miles. The scheduled departure time is broken into 2 parts, in hours and minutes. this is captured in the hour and minute variables respectively. The last variable time_hour gives the entire date along with time for the scheduled departure time.
head(flights,9)
The head function gives us a basic overview of the data in a tabular format. It showed that there were 336776 flights in total that departed from New York in the year 2013.
str(flights)
## Classes 'tbl_df', 'tbl' and 'data.frame': 336776 obs. of 19 variables:
## $ year : int 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
## $ month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ day : int 1 1 1 1 1 1 1 1 1 1 ...
## $ dep_time : int 517 533 542 544 554 554 555 557 557 558 ...
## $ sched_dep_time: int 515 529 540 545 600 558 600 600 600 600 ...
## $ dep_delay : num 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
## $ arr_time : int 830 850 923 1004 812 740 913 709 838 753 ...
## $ sched_arr_time: int 819 830 850 1022 837 728 854 723 846 745 ...
## $ arr_delay : num 11 20 33 -18 -25 12 19 -14 -8 8 ...
## $ carrier : chr "UA" "UA" "AA" "B6" ...
## $ flight : int 1545 1714 1141 725 461 1696 507 5708 79 301 ...
## $ tailnum : chr "N14228" "N24211" "N619AA" "N804JB" ...
## $ origin : chr "EWR" "LGA" "JFK" "JFK" ...
## $ dest : chr "IAH" "IAH" "MIA" "BQN" ...
## $ air_time : num 227 227 160 183 116 150 158 53 140 138 ...
## $ distance : num 1400 1416 1089 1576 762 ...
## $ hour : num 5 5 5 5 6 5 6 6 6 6 ...
## $ minute : num 15 29 40 45 0 58 0 0 0 0 ...
## $ time_hour : POSIXct, format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
I found that the variables- year, month and day are assigned as integers rather than datetime. The only variable that is assigned as datetime is time_hour.
summary(flights)
## year month day dep_time
## Min. :2013 Min. : 1.000 Min. : 1.00 Min. : 1
## 1st Qu.:2013 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.: 907
## Median :2013 Median : 7.000 Median :16.00 Median :1401
## Mean :2013 Mean : 6.549 Mean :15.71 Mean :1349
## 3rd Qu.:2013 3rd Qu.:10.000 3rd Qu.:23.00 3rd Qu.:1744
## Max. :2013 Max. :12.000 Max. :31.00 Max. :2400
## NA's :8255
## sched_dep_time dep_delay arr_time sched_arr_time
## Min. : 106 Min. : -43.00 Min. : 1 Min. : 1
## 1st Qu.: 906 1st Qu.: -5.00 1st Qu.:1104 1st Qu.:1124
## Median :1359 Median : -2.00 Median :1535 Median :1556
## Mean :1344 Mean : 12.64 Mean :1502 Mean :1536
## 3rd Qu.:1729 3rd Qu.: 11.00 3rd Qu.:1940 3rd Qu.:1945
## Max. :2359 Max. :1301.00 Max. :2400 Max. :2359
## NA's :8255 NA's :8713
## arr_delay carrier flight tailnum
## Min. : -86.000 Length:336776 Min. : 1 Length:336776
## 1st Qu.: -17.000 Class :character 1st Qu.: 553 Class :character
## Median : -5.000 Mode :character Median :1496 Mode :character
## Mean : 6.895 Mean :1972
## 3rd Qu.: 14.000 3rd Qu.:3465
## Max. :1272.000 Max. :8500
## NA's :9430
## origin dest air_time distance
## Length:336776 Length:336776 Min. : 20.0 Min. : 17
## Class :character Class :character 1st Qu.: 82.0 1st Qu.: 502
## Mode :character Mode :character Median :129.0 Median : 872
## Mean :150.7 Mean :1040
## 3rd Qu.:192.0 3rd Qu.:1389
## Max. :695.0 Max. :4983
## NA's :9430
## hour minute time_hour
## Min. : 1.00 Min. : 0.00 Min. :2013-01-01 05:00:00
## 1st Qu.: 9.00 1st Qu.: 8.00 1st Qu.:2013-04-04 13:00:00
## Median :13.00 Median :29.00 Median :2013-07-03 10:00:00
## Mean :13.18 Mean :26.23 Mean :2013-07-03 05:22:54
## 3rd Qu.:17.00 3rd Qu.:44.00 3rd Qu.:2013-10-01 07:00:00
## Max. :23.00 Max. :59.00 Max. :2013-12-31 23:00:00
##
This gave me the distribution of values for each variable (only for int and datetime datatypes), where we find out that there has been a delay in the scheduled departure time of 12.64 minutes in the entire year.
The first question I would ask would be : for which month is the delay the highest (departure + arrival delay), and why. I feel this is interetsing to understand the pattern of flights around the year. Does the weather during a particular affect the delay or is it some other cause? To understand the delay more, Which carrier gives the highest delay? Is there a delay because of a carrier or weather conditions in a month.
The second question I found intersting is the relation between source airport and carrier. This will help me understand which airline has the highest frequency of operation in which airport in New york.
For each of the questions we proposed above, we perform an exploratory data analysis designed to address the question.
For the first question, I have plotted the graph between Month and delay (departure + arrival delay), and it shows that during July, there has been the highest delay. Also, there has been a substantial dip in delay for October- November. This means that there has been less delays during that time. So there is a possiblilty that weather conditions during months, or conditions such as over booking, more flights due to some occassion may affect the delay.
flights$totalDelay<-flights$dep_delay+flights$arr_delay
ggplot(data = flights)+
geom_smooth(mapping = aes(x = month, y = totalDelay ), na.rm= TRUE)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
To examine the relation between carrier and delay, I mapped a boxplot.This is however a widespread graph, and gives us less information on the median values.
ggplot(data = flights) +
geom_boxplot(mapping = aes(x = carrier, y = totalDelay), na.rm=TRUE)
For the 2nd question, I decided to explore the relation between airlines and source airport. To understand that, I plotted facet graphs to understand each relation individually using the facet_wrap function. It showed me that in JFK, Jetblue(B6) is the most used airline compared to other airlines.
ggplot(data=flights, aes(x=origin, group=carrier, fill=carrier)) +
geom_density(adjust=1.5) +
facet_wrap(~carrier)
I then combined the individual observations to get a better comparitve understanding of the relation. By overlapping the densities, I can posit that there is a higher frequency of carriers departing from JFK comparitively.
ggplot(data=flights, aes(x=origin, group=carrier, fill=carrier)) +
geom_density(adjust=1.5, alpha=.4)
How many fligts out of NYC are there in the data?
dim(flights)
## [1] 336776 20
# Ans: 336776
# Dim gives us the total number of rows and columns, which give us the number of flights departing NYC.
How many NYC airports are included in this data? Which airports are these?
length(unique(flights$origin))
## [1] 3
# Ans: 3 NYC airports are included in the data for departure. We used Unique function to get the distinct values of the #departure airport.
Into how many airports did the airlines fly from NYC in 2013?
length(unique(flights$dest))
## [1] 105
# Ans: The airplanes flew into 105 airports. We used the distinct column to fetch the airport details.
How many flights were there from NYC to Seattle (airport code SEA)?
p <- dim(filter (flights, dest == "SEA"))
p
## [1] 3923 20
# Ans: there were 3923 that landed in Seattle. We used the Filter function to extract the results.
Were the any flights from NYC to Spokane GAG?
GAG <- dim(filter (flights, dest == "GAG"))
GAG
## [1] 0 20
# Ans: No, there werren't any, as teh result came to 0.
Checking if there are any destinations that do not look like valid airport codes (i.e. three-letter-all-upper case)?
lower <- str_detect(flights$dest, "^[:lower:]+$")
# This is to find all the destination airports with lower case codes
three <- nchar(flights$dest)
#three[three ==3]
# this is to find the character length of destination airports so that we can compare it with 3
lower <- str_detect(flights$dest, "^[:lower:]+$")
#length(lower[lower==TRUE])
# To detect lower case charaters
charc <- grepl("^[A-Za-z]+$", flights$dest, perl = T)
#charc
# To check if the column has only alphabets and no numerical values
DestinationCode <- filter(flights, is.na(flights$dest) & nchar(flights$dest)!=3 & lower ==TRUE & charc==FALSE)
DestinationCode
# ANS: 0 values with invalid airport codes
What is the typical delay of flights in this data?
mean(flights$arr_delay[flights$arr_delay>0], na.rm=TRUE)
## [1] 40.3425
#flights %>% summarise(mean(arr_delay), rm.na=TRUE)
#mean(flights$totalDelay[flights$totalDelay>0], na.rm=TRUE)
#ANS: typical arrival delay = 40.3425 minutes
Which ones are the worst three destinations from NYC if we don’t like flight delays?
#sort(flights$arr_delay,decreasing=TRUE)
flights[order(flights$arr_delay, decreasing = TRUE),c("arr_delay","dest")]
#ANs: HNL, CMH, ORD
How many flights were there from NYC airports to Portland in 2013?
p <- dim(filter (flights, dest == "PDX"))
p
## [1] 1354 20
#1354
How many airlines fly from NYC to Portland?
#grp_by <- group_by(flights,carrier)
unique(flights$carrier[flights$dest=="PDX"])
## [1] "DL" "UA" "B6"
Which are these airlines (find the 2-letter abbreviations)? How many times did each of these go to Portland?
gr <- group_by(flights, carrier,dest)
gr
p <- summarise(gr, count=n())
p
newdata <- flights[ which(flights$dest=='PDX'
& flights$carrier =="DL"), ]
dim(newdata)
## [1] 458 20
#ans: 458
newdata <- flights[ which(flights$dest=='PDX'
& flights$carrier =="UA"), ]
dim(newdata)
## [1] 571 20
# ANs: 571
newdata <- flights[ which(flights$dest=='PDX'
& flights$carrier =="B6"), ]
dim(newdata)
## [1] 325 20
# Ans: 325
How many different airplanes arrived from each of the three NYC airports to Portland?
p <-unique(flights$origin)
p
## [1] "EWR" "LGA" "JFK"
p <-unique(flights$tailnum[flights$dest=="PDX" & flights$origin=="JFK"])
length(p)
## [1] 195
p <-unique(flights$tailnum[flights$dest=="PDX" & flights$origin=="EWR"])
length(p)
## [1] 297
p <-unique(flights$tailnum[flights$dest=="PDX" & flights$origin=="LGA"])
length(p)
## [1] 0
What percentage of flights to Portland were delayed at departure by more than 15 minutes?
p <- filter(flights, dep_delay > 15, dest=="PDX")
count(p)
l <- filter(flights, dest=="PDX")
count(l)
count(p) / count(l)*100
#graphical:
ggplot(data = flights) +
geom_smooth(mapping = aes(x = month, y = arr_delay, colour=origin))
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 9430 rows containing non-finite values (stat_smooth).
ggplot(data = flights) +
geom_histogram(mapping = aes(x = month), binwidth = 0.5)
#Tabular:
grp_by <- group_by(flights,month)
#summarise(grp_by,delay = mean(dep_delay, na.rm = TRUE))
ggplot(data = flights) +
geom_histogram(mapping = aes(x=month, binwidth = 0.1)) +
geom_smooth(mapping = aes(x = month, y= arr_delay))
## Warning: Ignoring unknown aesthetics: binwidth
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 9430 rows containing non-finite values (stat_smooth).
flights %>%
count(month)
head(flights,100)
summary(flights[order(flights$arr_delay, decreasing = TRUE),c("arr_delay","dest","month")])
## arr_delay dest month
## Min. : -86.000 Length:336776 Min. : 1.000
## 1st Qu.: -17.000 Class :character 1st Qu.: 4.000
## Median : -5.000 Mode :character Median : 7.000
## Mean : 6.895 Mean : 6.549
## 3rd Qu.: 14.000 3rd Qu.:10.000
## Max. :1272.000 Max. :12.000
## NA's :9430